Linear Time Clustering for High Dimensional Mixtures of Gaussian Clouds

نویسندگان

  • Dan Kushnir
  • Shirin Jalali
  • Iraj Saniee
چکیده

Clustering mixtures of Gaussian distributions is a fundamental and challenging problem that is ubiquitous in various high-dimensional data processing tasks. While state-of-the-art work on learning Gaussian mixture models has focused primarily on improving separation bounds and their generalization to arbitrary classes of mixture models, less emphasis has been paid to practical computational efficiency of the proposed solutions. In this paper, we propose a novel and highly efficient clustering algorithm for n points drawn from a mixture of two arbitrary Gaussian distributions in R. The algorithm involves performing random 1-dimensional projections until a direction is found that yields a user-specified clustering error e. For a 1-dimensional separation parameter γ satisfying γ = Q−1(e), the expected number of such projections is shown to be bounded by o(ln p), when γ satisfies γ ≤ c √ ln ln p, with c as the separability parameter of the two Gaussians in R. Consequently, the expected overall running time of the algorithm is linear in n and quasi-linear in p at o(ln p)O(np), and the sample complexity is independent of p. This result stands in contrast to prior works which provide polynomial, with at-best quadratic, running time in p and n. We show that our bound on the expected number of 1-dimensional projections extends to the case of three or more Gaussian components, and we present a generalization of our results to mixture distributions beyond the Gaussian model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures

We consider the problem of clustering data points in high dimensions, i.e., when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with two non-spherical Gaussian components, where the clusters are distinguished by only a few relevant dimensions. The method we propose is a combination of a recent approach for le...

متن کامل

Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture...

متن کامل

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

Adaptive Mixtures of Factor Analyzers

A mixture of factor analyzers is a semi-parametric density estimator that generalizes the well-known mixtures of Gaussians model by allowing each Gaussian in the mixture to be represented in a different lower-dimensional manifold. This paper presents a robust and parsimonious model selection algorithm for training a mixture of factor analyzers, carrying out simultaneous clustering and locally l...

متن کامل

A New Algorithm in Blind Source Separation for High Dimensional Data Sets Such as Meg Data

BSS is one of the well-known methods of signal processing. This method is based on recovering of original sources from observed mixtures without any further information about mixing system and original sources. In many applications, mixtures are combination of Non-Gaussian and Time-Correlated components. MCOMBI algorithm is known as a method for separation of these kinds of sources. The perform...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1712.07242  شماره 

صفحات  -

تاریخ انتشار 2017